Motivations

Nowadays, most customers will search for restaurants on Yelp to decide where to eat for the next meal. The review and rating of a restaurant is of great importance and will help a customer to decide whether this is a good place to eat. Our goal is to find what property makes a restaurant popular and what drives a customer to choose a certain restaurant. Therefore, we choose to explore data of Yelp since it has reviews from customers all over the world.  In this project, we have a map of restaurant showing the distribution of restaurants in different categories and some plots to reflect the relationship between customer flow(popularity) and category of a restaurant. We did some regression to help predict customer flow(popularity) of a restaurant. And  We also include some analysis of the review of a restaurant. This might help us to decide what a customer mostly looks for when choosing a restaurant.

Initial questions

The question we trying to answer at first is ‘what words are frequently mentioned in the negative and positive reviews?’ and ‘how does the category, location, opening hours and other attributes of a restaurant influence the customer flow?’  In the course of our analysis, we are interested in whether we could recommend a restaurant for a customer and give some advice to a restaurant on how to improve their business.

Data

The data used in this project is part of the Yelp Dataset Challenge. The dataset consists of a set of JSON files that include business information, reviews, tips, user information and check-ins. Variables in Business are business id, name, location, opening hours, category, average star rating, the number of reviews about the business and a series of attributes like noise level or reservations policy. Review objects contain star rating, the review text, the review date, and the number of votes that the review has received. In this project, we have focused on these two type of objects.  Source: https://drive.google.com/drive/folders/190oLdoSyZVnydl9Jxth8G7aKzu3bDgee?usp=sharing  The original review json: https://drive.google.com/file/d/1wbIRfISw2ZW0zy4bZGtnRyhUEOImi5_H/view?usp=sharing  The r code to clean the review json(we only extract 5000 reviews from this dataset because the original one is too large to process): https://drive.google.com/file/d/1EXrW_F_GmEq-tQlmtnYkMDwtt-IOA3tq/view?usp=sharing  Scraping method: We use the stream_in function in R package “Jsonlite” to read in the data. Then We filtered the business by category to keep only those businesses in the restaurant category and in state of “AZ” (10219). We only extract first 5000 reviews from the original review json.

Data clean

For business json: 1. We read in the json file 2. We extract price range from nested column “attributes” and get rid of other attributes 3. We obtain average opening hours of a restaurant on weekdays and weekends 4. We assign a category to a restaurant base on the frequency of the category.
For review json: 1. We read in the review csv 2. We broke review text into words 3.We remove meaningless words using stop word in R package Tidytext ## Business json ### Getting Data of Restaurant in AZ from business.json

Getting the opening hours of a restaurant in AZ

business_time =  business %>% 
  select(business_id) %>% 
  rowid_to_column()
hours = business$hours %>% 
  rowid_to_column()

businesstime = left_join(business_time, hours)%>%
  na.omit()%>%
  separate(Monday,into = c("open1","close1"), sep = "-")%>%
  mutate(open1 = str_replace(open1, ":", "."),
         close1 = str_replace(close1, ":", ".") )
## Joining, by = "rowid"
businesstime = businesstime%>%
  separate(Tuesday,into = c("open2","close2"), sep = "-")%>%
  mutate(open2 = str_replace(open2, ":", "."),
         close2 = str_replace(close2, ":", ".") )

businesstime = businesstime%>%
  separate(Wednesday,into = c("open3","close3"), sep = "-")%>%
  mutate(open3 = str_replace(open3, ":", "."),
         close3 = str_replace(close3, ":", ".") )

businesstime = businesstime%>%
  separate(Thursday,into = c("open4","close4"), sep = "-")%>%
  mutate(open4 = str_replace(open4, ":", "."),
         close4 = str_replace(close4, ":", ".") )

businesstime = businesstime%>%
  separate(Friday,into = c("open5","close5"), sep = "-")%>%
  mutate(open5 = str_replace(open4, ":", "."),
         close5 = str_replace(close4, ":", ".") )

businesstime = businesstime%>%
  separate(Saturday,into = c("open6","close6"), sep = "-")%>%
  mutate(open6 = str_replace(open6, ":", "."),
         close6 = str_replace(close6, ":", ".") )

businesstime = businesstime%>%
  separate(Sunday,into = c("open7","close7"), sep = "-")%>%
  mutate(open7 = str_replace(open7, ":", "."),
         close7 = str_replace(close7, ":", ".") )



library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
cols = c(3:16)
businesstime[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))

time = businesstime%>%
  mutate(mon = close1-open1,
         tue = close2 -open2,
         wed = close3-open3,
         thurs = close4 -open4,
         fri = close5-open5,
         sat = close6-open6,
         sun = close7-open7)

z=function(x){
  if(is.numeric(x)){
    ifelse(x<0,x+24,x)
  } 
}


a = time %>% select(-rowid, -business_id)
 b = map_df(a,z)

b=b%>%
  rowid_to_column()
c = time%>%
  select(rowid, business_id)

time = left_join(c, b) %>%
  mutate(weekdays = (mon+tue+wed+thurs+fri)/5,weekend= (sat+sun)/2) %>%
  select(weekdays,weekend,business_id)
## Joining, by = "rowid"
cate.dat = inner_join(rest.dat,time,by="business_id")
# cate.dat contains the restaurants in Arizona with variables in rest.dat and their opening hours on weekdays and weekend.

Assign a category to a restaurant

a rough preview of data

cate.dat = cate.dat %>%
  unnest(categories)

# get an overview of categories, for simplicity, we are only showing 50 categories
cate.dat %>%
  count(categories, sort = TRUE) %>%
  top_n(50)%>%
  mutate(categories = fct_reorder(categories, n)) %>% 
  ggplot(aes(x = categories, y = n)) + 
  geom_bar(stat = "identity", fill = "blue", alpha = .6) + 
  coord_flip()
## Selecting by n

We notice that: 1. ‘Restaurants’, ‘food’, ‘fast food’, ‘bars’, ‘sports bars’, ‘Event Planning & Serving’ gives no specific information of what a restaurant serves and will cause a restaurant to have several categories. Therefore, we need to remove them.
2. Category ‘bars’ and ‘nightlife’ are somewhat overlaid. Therefore, we remove ‘night life’.

Extract top 20 common categories for simplicity of analysis

# remove categories that do not provide useful information of what a restaurant serves
categories <- c('Restaurants','Food','Nightlife','Event Planning & Services','Sports Bars','Bars','Fast Food')
stop_cates = data.frame(categories) %>%
  mutate(categories = as.character(categories))
cate.dat1 = cate.dat
cate.dat = 
  anti_join(cate.dat, stop_cates)
## Joining, by = "categories"
# categories dataset contains top 20 common categories   
categories = cate.dat %>%
  count(categories, sort = TRUE) %>%
# omit categories with few restaurants, this is a trade-off because we might lose some restaurant, but we will get rid of some noise too
  top_n(20)
## Selecting by n
# get an overview of categories dataset
categories %>%
  mutate(categories = fct_reorder(categories, n)) %>% 
  ggplot(aes(x = categories, y = n)) + 
  geom_bar(stat = "identity", fill = "blue", alpha = .6) + 
  coord_flip()

# We could see that the remaining categories all provide specific information for a restaurant. There are most restaurants with category "American(traditional)" and numbers of restaurant of "Mexican", "Sandwiches" and "Pizza" are close. 

# final.dat dataset contains restaurants in top 20 common categories, but a restaurant might have multiple categories
final.dat = inner_join(categories,cate.dat) 
## Joining, by = "categories"
# assign the category with the highest frequency to each restaurant
final.dat = final.dat %>%
  group_by(business_id) %>%
  mutate(category=categories[1],n=n[1])%>%
  ungroup()%>%
  select(-categories)%>%
  unique()

# get an overview of categories in final.dat
final.dat %>%  
select(-n) %>%
count(category, sort = TRUE)
## # A tibble: 20 x 2
##                  category     n
##                     <chr> <int>
##  1 American (Traditional)  1057
##  2                Mexican   884
##  3             Sandwiches   803
##  4                  Pizza   586
##  5         American (New)   452
##  6                Chinese   402
##  7                Burgers   241
##  8     Breakfast & Brunch   186
##  9             Sushi Bars   138
## 10                Italian   131
## 11          Mediterranean    87
## 12                Seafood    85
## 13          Chicken Wings    72
## 14                  Cafes    70
## 15                  Salad    52
## 16           Asian Fusion    44
## 17            Steakhouses    33
## 18               Japanese    31
## 19           Coffee & Tea    27
## 20                  Delis    26

Conclusion of the data clean of business json

Comparing to the original dataset, in final.dat, each restaurant is assigned a category and opening hours on weekdays and weekends, the nested columns “attributes” and “opening hours” are removed. We will use for the visualization and analysis of data. For the order of common categories, the order of top 5 categories(American(Traditional), Mexican, Sandwiches, American(New)) remains unchanged. However, the latter ones changed because some restaurant have multiple categories and the less common ones are dropped.

Review json

Exploratory analysis of business json (Visualization)

A map of restaurants showing the distribution of categories

final.dat %>%
  mutate(text_label = str_c('Name:',name,"\nPostal code: ", postal_code, '\nAddress: ', address, '\nCategory: ', category)) %>% 
  plot_ly(x = ~longitude, y = ~latitude, type = "scatter", mode = "markers",
          alpha = 0.5, 
          color = ~category,
          text = ~text_label)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## A map of sandwiches restaurants

final.dat %>%
filter(category=="Sandwiches") %>%
mutate(text_label = str_c('Name:',name,"\nPostal code: ", postal_code, '\nAddress: ', address, '\nCategory: ', category))%>%   
plot_ly(x = ~longitude, y = ~latitude, type = "scatter", mode = "markers",alpha = 0.5, color = ~category, text = ~text_label)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

### Analysis On x-axis is the longitude of a restaurant, on y-axis is the latitude of a restaurant. We visualize the distribution of restaurant and use different colour for different category. On the map, we can see that most restaurants serve American food, sandwiches, pizza and mexican food. Some restaurant might be masked because of the overlying of the location. Therefore, we then choose to show the map of sandwiches restaurant to get a vivid view of location. The restaurants tend to cluster around some point. This informs us that location of a restaurant might influence the business.This plot will serve as an interactivity plot on our website. Users could choose the category they want to try and restrict the range of rating and price.

A boxplot showing the distribution of review count of 10 most common categories

# We’re going to show only 10 categories with the most restaurants
common_category =
  final.dat %>% 
  count(category, sort = TRUE) %>% 
  top_n(10) %>% 
  select(category)
## Selecting by nn
## Selecting by n

inner_join(final.dat, common_category,
             by = "category") %>% 
   group_by(category) %>%  
   mutate(median_review = median(review_count)) %>%  
   ungroup()%>%  
   mutate(category = fct_reorder(category, median_review)) %>% 
  plot_ly(y = ~review_count, color = ~category, type = "box",
          colors = "Set2")

### Analysis From the box plot, we could obtain that restaurant serving burgers might have lowest number of review_count. The range of number of review_count of American(New) restaurants are biggest. The median of number of review of sushi bars ranks highest among the categories. Overall, most review_count are around 100. However, we could see many outliers around 500 - 1000 in each categories indicating some restaurants might enjoy great popularity.

A bar chart showing the mean of number of review_count in each category

final.dat%>%   
group_by(category) %>%  
mutate(mean_review = round(mean(review_count),digits=0)) %>%  
ungroup()%>%  
mutate(category = fct_reorder(category, mean_review)) %>%   
plot_ly(x = ~category, y = ~mean_review, color = ~category, type = "bar")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

### Analysis In this bar chart, we could see that Seafood and Sushi restaurants have most mean number of reviews. This is surprising because seafood restaurants is not among the most common restaurants. This might indicate that people tend to go to restaurants that are not so common around the area.

Bubble plot showing the distribution of ratings across different categories

#A grid of detailed average ratings by categories
#Basically, the average review rating scores in each state was reclassified from 1.0 to 5.0 by 0.5 increase. For visualization purpose, percentage of rating score is weighted.
dataWeighted_catestar <- final.dat %>% 
  group_by(category,stars) %>%
  summarise(totalByStar = n()) %>% arrange(desc(stars)) %>% 
  mutate(total = sum(totalByStar)) %>% mutate(percent = round((totalByStar / total)*100, 1)) %>%
  mutate(percentWeight = ifelse(percent >= 20, percent * 1.5, # custom column to weight the percent for size on the plot
                                ifelse(percent < 20 & percent >= 15, percent * 1, 
                                       ifelse(percent < 15 & percent >= 10, percent,
                                              ifelse(percent < 10 & percent >= 5, percent * 0.6, 1)))))


  library(ggplot2)
ggplot(dataWeighted_catestar, aes(x = category, y = stars, label = percent)) + 
    geom_point(aes(size = percentWeight * 0.8, colour = stars, alpha = 0.05)) + 
    geom_text(hjust = 0.4, size = 4) + scale_size(range = c(1, 30), guide = "none") + 
    scale_color_gradient(low = "darkblue", high = "red") + labs(title = "A grid of detailed avg.ratings by category ", 
    x = "Category", y = "Detailed Avg.Ratings") + scale_y_continuous(breaks = seq(1, 
    5, 0.5)) + theme(legend.title = element_blank(), axis.text.x = element_text(angle = 90, hjust = 1))

city <- final.dat %>% 
  group_by(city) %>% 
  summarise(totalCity = n()) %>% 
  top_n(10)
## Selecting by totalCity
dataWeighted_city <- left_join(city, final.dat) %>% 
  group_by(city,stars) %>%
  summarise(totalByStar = n()) %>% arrange(desc(stars)) %>% 
  mutate(total = sum(totalByStar)) %>% mutate(percent = round((totalByStar / total)*100, 1)) %>%
  mutate(percentWeight = ifelse(percent >= 20, percent * 1.5, # custom column to weight the percent for size on the plot
                                ifelse(percent < 20 & percent >= 15, percent * 1, 
                                       ifelse(percent < 15 & percent >= 10, percent,
                                              ifelse(percent < 10 & percent >= 5, percent * 0.6, 1)))))
## Joining, by = "city"
ggplot(dataWeighted_city, aes(x = city, y = stars, label = percent)) + 
    geom_point(aes(size = percentWeight * 0.8, colour = stars, alpha = 0.05)) + 
    geom_text(hjust = 0.4, size = 4) + scale_size(range = c(1, 30), guide = "none") + 
    scale_color_gradient(low = "blue", high = "green") + labs(title = "A grid of detailed avg.ratings by city ", 
    x = "City", y = "Detailed Avg.Ratings") + scale_y_continuous(breaks = seq(1, 
    5, 0.5)) + theme(legend.title = element_blank())

### Analysis Different colors represent different rating levels. The number in the bubble represents the precentage of the specific rating level and this percentage can also be explained by the bubble size.

We can see from the bubble plot that there are 10 kinds of restaurants and nine levels of ratings, seperately shown in the y-axis and x-axis. Most restaurants have main rating level are around 4.0 and 3.5.The breakfast,Italian and new restaurants have the biggest percentage in 4.0. The precentage of each rating level of Hamburger Restaurants are around 15%, so they have more even distrutions.For breakfast restaurants, they have biggest 4.5 ratings percentage and more than 70% resturants have rating higher than 3.0 . Thus, we can conclude breakfats restaurants have highest rating levels among these 10 kinds of restaurants. However, wings restaurants have lowest percentage in 3.5 and 4.0 rating levels and over 50% of ratings below or equal to 3.0. Thus, wings restaurants shows the lowest rating.

In the second bubble plot, we choose 10 cities with most restuarants. The ratings is concentrated between 3.0 and 4.0. Percentage of rating 4.5 is around 10% in most cities. And find the resturants in Scottsdale have biggestt bubble in rating 4.0, meaning the city has highest percentage of 4.0 rating. Besides, in the lower rating level such as 2.0 and 2.5, city Goodyear have biggest bubbles among ten cities. Although there are some differences in every rating levels among cities, the variations are very extreme. Therefore, it seems that the mean rating levels among these cities would be similar.

Exploratory statistical analysis of business json

Regression

linear = cate.dat1 %>%
  mutate(categories = str_replace(categories,"c\\(","")) %>%
  unnest_tokens(category, categories) %>%
  select(-neighborhood)

mod.categories = linear %>%
  count(category, sort = TRUE) %>%
  top_n(3)
## Selecting by n
mod.final = inner_join(mod.categories,linear) 
## Joining, by = "category"
mod.final = mod.final %>%
  dplyr::filter(city == "Phoenix"|city == "Scottsdale"|city =="Mesa")

model = mod.final %>%
  dplyr::select(stars, category,city,review_count, RestaurantsPriceRange2,weekdays,weekend) %>%
  dplyr::mutate(category = as.factor(category),
         city = as.factor(city))

par(mar=c(3,3,1,1)) 
par(mfrow=c(2,3))
attach(model)
## The following object is masked _by_ .GlobalEnv:
## 
##     city
hist(stars)
hist(review_count)
hist(weekdays)
hist(weekend)

#We find that review_count is not normally distributed, thus we transform it to log_review.
model1 = model %>%
  mutate(log_review = log(review_count)) %>%
  dplyr::select(-review_count)
par(mfrow = c(1,1))

ggplot(model1, aes(x = log_review)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pairs(model1)

Before buidling models, we draw histograms check the assumption about the outcome ‘review_count’ and several predictors, and we find that review_count is not normally distributed. Thus we transform it into log_review and get relatively normal distribution. And we also draw the plot matrix for all variables to see wether have potentially linear association. And then we will continue to build our models.

model1 = model1%>%
  na.omit()%>%
  mutate(stars = as.factor(stars))

mult.fit = lm( log_review~ ., data=model1)
step(mult.fit, direction='backward')
## Start:  AIC=650.75
## log_review ~ stars + category + city + RestaurantsPriceRange2 + 
##     weekdays + weekend
## 
##                          Df Sum of Sq    RSS     AIC
## <none>                                3224.2  650.75
## - weekdays                1      4.40 3228.6  652.19
## - weekend                 1      5.11 3229.3  652.75
## - category                2     18.47 3242.7  661.16
## - city                    2     20.97 3245.2  663.11
## - RestaurantsPriceRange2  1    255.32 3479.5  841.02
## - stars                   8    517.53 3741.8 1010.33
## 
## Call:
## lm(formula = log_review ~ stars + category + city + RestaurantsPriceRange2 + 
##     weekdays + weekend, data = model1)
## 
## Coefficients:
##            (Intercept)                stars1.5                  stars2  
##                1.07864                 0.67123                 0.95726  
##               stars2.5                  stars3                stars3.5  
##                1.36086                 1.66508                 2.06183  
##                 stars4                stars4.5                  stars5  
##                2.30725                 2.31458                 0.40365  
##           categoryfood     categoryrestaurants             cityPhoenix  
##               -0.27613                -0.15985                 0.21890  
##         cityScottsdale  RestaurantsPriceRange2                weekdays  
##                0.26695                 0.61913                 0.01956  
##                weekend  
##               -0.01990
summary (mult.fit)
## 
## Call:
## lm(formula = log_review ~ ., data = model1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3077 -0.7083  0.1351  0.8071  3.6779 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.07864    0.66487   1.622 0.104858    
## stars1.5                0.67123    0.67287   0.998 0.318591    
## stars2                  0.95726    0.66574   1.438 0.150591    
## stars2.5                1.36086    0.65929   2.064 0.039107 *  
## stars3                  1.66508    0.65786   2.531 0.011432 *  
## stars3.5                2.06183    0.65715   3.138 0.001723 ** 
## stars4                  2.30725    0.65702   3.512 0.000453 ***
## stars4.5                2.31458    0.65890   3.513 0.000451 ***
## stars5                  0.40365    0.68025   0.593 0.552982    
## categoryfood           -0.27613    0.07290  -3.788 0.000156 ***
## categoryrestaurants    -0.15985    0.06128  -2.608 0.009149 ** 
## cityPhoenix             0.21890    0.06176   3.544 0.000401 ***
## cityScottsdale          0.26695    0.07007   3.810 0.000142 ***
## RestaurantsPriceRange2  0.61913    0.04394  14.090  < 2e-16 ***
## weekdays                0.01956    0.01058   1.849 0.064552 .  
## weekend                -0.01990    0.00998  -1.994 0.046309 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.134 on 2507 degrees of freedom
## Multiple R-squared:  0.2636, Adjusted R-squared:  0.2592 
## F-statistic: 59.83 on 15 and 2507 DF,  p-value: < 2.2e-16
#lm(formula = log_review ~ stars + category + city + RestaurantsPriceRange2 + weekdays + weekend, data = model1)

mult.fit1 = lm(log_review ~ stars + category + city + RestaurantsPriceRange2 + weekdays + weekend, data=model1)
summary(mult.fit1)
## 
## Call:
## lm(formula = log_review ~ stars + category + city + RestaurantsPriceRange2 + 
##     weekdays + weekend, data = model1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3077 -0.7083  0.1351  0.8071  3.6779 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.07864    0.66487   1.622 0.104858    
## stars1.5                0.67123    0.67287   0.998 0.318591    
## stars2                  0.95726    0.66574   1.438 0.150591    
## stars2.5                1.36086    0.65929   2.064 0.039107 *  
## stars3                  1.66508    0.65786   2.531 0.011432 *  
## stars3.5                2.06183    0.65715   3.138 0.001723 ** 
## stars4                  2.30725    0.65702   3.512 0.000453 ***
## stars4.5                2.31458    0.65890   3.513 0.000451 ***
## stars5                  0.40365    0.68025   0.593 0.552982    
## categoryfood           -0.27613    0.07290  -3.788 0.000156 ***
## categoryrestaurants    -0.15985    0.06128  -2.608 0.009149 ** 
## cityPhoenix             0.21890    0.06176   3.544 0.000401 ***
## cityScottsdale          0.26695    0.07007   3.810 0.000142 ***
## RestaurantsPriceRange2  0.61913    0.04394  14.090  < 2e-16 ***
## weekdays                0.01956    0.01058   1.849 0.064552 .  
## weekend                -0.01990    0.00998  -1.994 0.046309 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.134 on 2507 degrees of freedom
## Multiple R-squared:  0.2636, Adjusted R-squared:  0.2592 
## F-statistic: 59.83 on 15 and 2507 DF,  p-value: < 2.2e-16
anova(mult.fit1)
## Analysis of Variance Table
## 
## Response: log_review
##                          Df Sum Sq Mean Sq  F value    Pr(>F)    
## stars                     8  755.6  94.447  73.4378 < 2.2e-16 ***
## category                  2   96.2  48.095  37.3964 < 2.2e-16 ***
## city                      2   41.3  20.636  16.0459  1.19e-07 ***
## RestaurantsPriceRange2    1  255.9 255.943 199.0090 < 2.2e-16 ***
## weekdays                  1    0.1   0.110   0.0855   0.77005    
## weekend                   1    5.1   5.111   3.9743   0.04631 *  
## Residuals              2507 3224.2   1.286                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#However, we find that in significant, p-value of weekend variable is greater than 0.05.

mult.fit3 = lm(log_review ~ stars + category + city + RestaurantsPriceRange2 + weekdays, data=model1)
anova(mult.fit3,mult.fit1)
## Analysis of Variance Table
## 
## Model 1: log_review ~ stars + category + city + RestaurantsPriceRange2 + 
##     weekdays
## Model 2: log_review ~ stars + category + city + RestaurantsPriceRange2 + 
##     weekdays + weekend
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   2508 3229.3                              
## 2   2507 3224.2  1    5.1112 3.9743 0.04631 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mar=c(3,3,1,1)) 
par(mfrow=c(2,2))
plot(mult.fit3)

res<-t.test(model1$weekdays, model1$weekend, var.equal=FALSE, paired=FALSE)               # var.equal=FALSE is the default, so no need to specifically write it.
res
## 
##  Welch Two Sample t-test
## 
## data:  model1$weekdays and model1$weekend
## t = 8.0697, df = 5028.1, p-value = 8.738e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7228649 1.1867902
## sample estimates:
## mean of x mean of y 
## 10.375973  9.421145

Analysis

In this linear model, we consider reviews as the outcome and try to find the associations among those variables. However, we find the review variable is not normally distributed, so we transfrom it into log_review. At first, we clean the linear data and choose three most categories—bars, restaurants and food. The we use backward stepwise to build the model. Then, we get “log_review ~ stars + category + city + RestaurantsPriceRange2 + weekdays + weekend” model. In this model, city stars and catogory predictors are categorical and “weekdays”,“weekends” are continuous. However, in summary and general ANOVA Test, we find the coefficient of weekend are not significant. In this way, we use partial anova test, we assume that the null hypothesis is “log_review ~ stars + category + city + RestaurantsPriceRange2 + weekdays” and find the p-value is greater than 0.05 and we fail to reject null hypothesis and choose the small model. And this also satisfies “parsimony”.

In addition, through t test, we test the difference between business hours of weekdays and weekends and the results shows that their business hours are significantly different.

From the diagnostic plots, we can see that the model satisifes the linear assumption and doesn’t find many influential outliers.

Exploratory statistical analysis of review csv

explore words

review <- read_csv("../data/review_sample_5000.csv") %>% 
  clean_names() %>% 
  select(business_id, everything()) %>% 
  nest(x1:cool)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   review_id = col_character(),
##   user_id = col_character(),
##   business_id = col_character(),
##   stars = col_integer(),
##   date = col_date(format = ""),
##   text = col_character(),
##   useful = col_integer(),
##   funny = col_integer(),
##   cool = col_integer()
## )
businessId = business.dat %>%
  mutate(categories = as.character(categories)) %>%
  filter(str_detect(categories, "Restaurants")) %>% 
  select(business_id)
review_business = inner_join(businessId, review, by  = "business_id") %>% 
  unnest()

Regression

Analysis

review_length = review_business %>%
  mutate(text = as.character(text),
         inspection_num = row_number()) %>%
  unnest_tokens(word, text) %>% 
  group_by(inspection_num) %>% 
  count() 
#split text, remove stop words
data("stop_words")
inspection_words = review_business %>%
  mutate(text = as.character(text),
         inspection_num = row_number()) %>% 
  unnest_tokens(word, text)
inspection_words = 
  anti_join(inspection_words, stop_words)
## Joining, by = "word"
bing_sentiments = get_sentiments("bing")
#neg and pos numbers
inspection_sentiments = inspection_words %>% 
  inner_join(., bing_sentiments) %>% 
  count(inspection_num, sentiment) %>% 
  spread(sentiment, n, fill = 0) 
## Joining, by = "word"
star = review_business %>%
  mutate(text = as.character(text),
         inspection_num = row_number()) %>% 
  select(inspection_num, stars)

inspection_lm = inner_join(star, inspection_sentiments, by = "inspection_num") %>% 
  inner_join(., review_length)
## Joining, by = "inspection_num"

fit a linear model

reviews_reg = lm(stars ~ n + positive + negative, data = inspection_lm)
summary(reviews_reg)
## 
## Call:
## lm(formula = stars ~ n + positive + negative, data = inspection_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0841 -0.7394  0.1255  0.8826  5.2862 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.6093219  0.0293202 123.100  < 2e-16 ***
## n           -0.0017103  0.0003171  -5.394 7.34e-08 ***
## positive     0.1612451  0.0067647  23.836  < 2e-16 ***
## negative    -0.2104708  0.0091261 -23.062  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.103 on 3499 degrees of freedom
## Multiple R-squared:  0.302,  Adjusted R-squared:  0.3015 
## F-statistic: 504.7 on 3 and 3499 DF,  p-value: < 2.2e-16

For every positive word, the predicted average star rating given is increased by 0.16 on average (e.g. 16 positive words indicate a 1-star increase) For every negative word, the predicted average star rating given is decreased by 0.21 on average (e.g. 21 negative words indicate a 1-star decrease) This model explains 30.1% of the variation in the number of stars given in a review. This sounds like a low percentage, but is impressive for such a simple model using unstructured real-world data.

Exploratory analysis of review csv (visulization)

bar chart & density plot

Positivity Bar chart

inspection_lm %>%
  filter(positive != 0) %>%
  mutate(stars = factor(stars),
         percent_pos = round((positive)/n,digits = 2)) %>%
  select(stars, percent_pos) %>% 
  group_by(percent_pos, stars) %>% 
  count() %>% 
 ggplot(aes(x = percent_pos, y = n, fill = stars)) +
  geom_col() +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::comma) +
  #theme_custom()+
  labs(title=paste("Yelp Review Positivity, by # of Stars for", format(nrow(inspection_lm),big.mark=","),"Reviews"), x="% Review Positivity (# Positive Words : # Words)", y="Total # of Reviews") +
  scale_fill_manual(breaks = c("1", "2", "3", "4", "5"), 
                       values = c("burlywood2", "violet", "violetred1", "violetred3", "deeppink4"))

Positivity Density

inspection_lm %>%
  filter(positive != 0) %>%
  mutate(s =stars,
    percent_pos = round((positive)/n,digits = 2)) %>%
ggplot(aes(x=percent_pos, fill=as.factor(stars), color=as.factor(stars), y=..density..)) +
  geom_density(alpha = 1, position="fill") +
  scale_x_continuous(limits=c(0,0.43), label = scales::percent) +
  scale_y_continuous(label = scales::comma) +
  theme(legend.title = element_blank(), legend.position="top", legend.direction="horizontal", legend.key.width=unit(0.5, "cm"), legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm"), panel.margin=element_blank()) +
  labs(title=paste("Yelp Positivity Proportion, by # Stars for", format(nrow(inspection_lm),big.mark=","),"Reviews"), x="% Review Positivity (# Positive Words : # Words)", y="Proportion of Reviews") +
  scale_fill_manual(values=c("burlywood2", "violet", "violetred1", "violetred3", "deeppink4"), labels = c("1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars")) +
  scale_color_manual(values=c("burlywood2", "violet", "violetred1", "violetred3", "deeppink4"), labels = c("1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars")) 
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing

This histogram of positivity scores shows that 1-star reviews have lower positivity compared to 4 or 5-star reviews. For those reviews whose positivity higher than 10%, The distribution for each star rating is close to a Normal distribution, with each successive rating category peaking at increasing positivity values. And the relative proportion of each star rating reinforces this.

Negativity Bar Chart

inspection_lm %>%
  filter(negative != 0) %>%
  mutate(stars = factor(stars),
         percent_neg = round((negative)/n,digits = 2)) %>%
  select(stars, percent_neg) %>% 
  group_by(percent_neg, stars) %>% 
  count() %>% 
 ggplot(aes(x = percent_neg, y = n, fill = stars)) +
  geom_col() +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::comma) +
  #theme_custom()+
  labs(title=paste("Yelp Review Negativity, by # of Stars for", format(nrow(inspection_lm),big.mark=","),"Reviews"), x="% Review Negativity (# Negative Words : # Words)", y="Total # of Reviews") +
  scale_fill_manual(breaks = c("1", "2", "3", "4", "5"), 
                       values = c("burlywood2", "violet", "violetred1", "violetred3", "deeppink4"))

Negativity Density

inspection_lm %>%
  filter(negative != 0) %>%
  mutate(s =stars,
    percent_neg = round((negative)/n,digits = 2)) %>%
ggplot(aes(x=percent_neg, fill=as.factor(stars), color=as.factor(stars), y=..density..)) +
  geom_density(alpha = 1, position="fill") +
  scale_x_continuous(limits=c(0,0.35), label = scales::percent) +
  scale_y_continuous(label = scales::comma) +
  theme(legend.title = element_blank(), legend.position="top", legend.direction="horizontal", legend.key.width=unit(0.5, "cm"), legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm"), panel.margin=element_blank()) +
  labs(title=paste("Yelp Negativity Proportion, by # Stars for", format(nrow(inspection_lm),big.mark=","),"Reviews"), x="% Review Negativity (# Negative Words : # Words)", y="Proportion of Reviews") +
  scale_fill_manual(values=c("burlywood2", "violet", "violetred1", "violetred3", "deeppink4"), labels = c("1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars")) +
  scale_color_manual(values=c("burlywood2", "violet", "violetred1", "violetred3", "deeppink4"), labels = c("1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars")) 
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing

The histogram of negative reviews looks much different than the positive one. The chart is heavily skewed right. We can see that 5-star reviews aren’t completely positive all the time, though they have lower negativity than the 1 star reviews.

At low negativity, the proportions of negative review scores (1-star, 2-stars, 3-stars) and positive review scores (4-stars, 5-stars) are about equal, implying that negative reviews can be just as civil as positive reviews. And then the proportions of negative review scores (1-star, 2-stars, 3-stars) increase as the negativity increase. High negativity presents mostly in 1-star and 2-star reviews, but also contains 4-star and 5-star reviews. Even 5-star reviews aren’t completely positive all the time.

word cloud

count_stop_words <- function(x) {
    word_array <- strsplit(as.character(x)," ")[[1]]
     
    return (sum(word_array %in% stop_words))
}

inspection_words = inspection_words%>% 
  filter(lapply(word, count_stop_words) < 1)

star1 = inspection_words %>% filter(stars == 1)
star1_word = star1 %>%
  group_by(word)%>%
  count(word,sort = TRUE)
star5 = inspection_words %>% filter(stars == 5)
star5_word = star5 %>%
  group_by(word)%>%
  count(word,sort = TRUE)

 library(wordcloud)
pal <- brewer.pal(9, "Reds")
pal <- pal[-c(1:3)]
png(filename = "words-1star.png", width = 3000, height = 3000, res= 300)
wordcloud(toupper(star1_word$word), star1_word$n, scale=c(11,.1), random.order=F, rot.per=.10, max.words=5000, colors=pal, random.color=T)

dev.off()

pal <- brewer.pal(9, "Greens")
pal <- pal[-c(1:3)]
png(filename = "words-5star.png", width = 3000, height = 3000, res= 300)
wordcloud(toupper(star5_word$word), star5_word$n, scale=c(9,.1), random.order=F, rot.per=.10, max.words=5000, colors=pal, random.color=T)

dev.off()
library(png)
image_1star = readPNG("../data/words-1star.png") 
image_5star = readPNG("../data/words-5star.png") 

We analyzed the most frequently used words in 5000 Yelp Reviews by using the word cloud function. When comparing the results between 5-star reviews and 1-star reviews, the difference is apparent. The 5-star reviews contain many instances like “love”, “amazing”, “delicious”, “friendly”. On the contrast, the 1-star reviews use very little positive language, and instead using words like “minutes”, “horrible”, ”worst”, “frozen”, presumably after long and unfortunate waits at the establishment, and the food were not good either. What’s more, comparing the two review word clouds, the 5-star reviews contain more adjectives, while the 1-star reviews contains more nouns. It seems that people tend to show strong affection when the food is good, and try to warn for potential customers and complain for the issues when the experience is horrible.

donut charts

 data_reviews_agg <- inspection_lm %>%
    group_by(stars) %>%
    summarize(count = n()) %>%
    mutate(fraction = count / sum(count),
            ymax = cumsum(fraction),
            ymin = c(0, head(ymax, n=-1)))
rank_colors = c("burlywood2", "violet", "violetred1", "violetred3", "deeppink4")
 
ggplot(aes(fill=as.factor(stars), color=as.factor(stars), ymax=ymax, ymin=ymin, xmax=5, xmin=4), data=data_reviews_agg) +
    geom_rect(color="white") +
     coord_polar(theta="y") +
     annotate("text", label = paste(format(data_reviews_agg$fraction * 100, digits=2),"%",sep=''), x=rep(6,5), y=(data_reviews_agg$ymin + data_reviews_agg$ymax)/2, col=rank_colors, size=3) +
     xlim(c(0, 6)) +
 theme(panel.grid=element_blank(), axis.text=element_blank(), axis.ticks=element_blank(), panel.background=element_blank(), axis.title.x = element_blank(), axis.title.y=element_blank(),legend.title = element_blank(),  legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm"),panel.border= element_blank()) +
     scale_fill_manual(values=rank_colors, labels = c("1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars")) +
  scale_color_manual(values=rank_colors, labels = c("1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars")) +
     annotate("text", x = 0, y = 0, label = "Customized ring plot", col="#1a1a1a", family="Source Sans Pro Light", size=11)
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : no font could be found for family "Source Sans Pro Light"

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : no font could be found for family "Source Sans Pro Light"

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : no font could be found for family "Source Sans Pro Light"

Conclusion of the reviews

Conclusion:From all the analysis above, we can conclude that Yelp reviews’ star ratings are are significantly associated. Yelp reviews with 5-star are generally positive while Yelp reviews with 1-star are generally negative. And people having great dining experience are more likely to write reviews on Yelp. Meanwhile, people having bad dining experience are more likely to write longer reviews to complain or warn the potential customers. Language plays a huge role in determining the ratings of reviews, and also knowledge could be applied to many other industries and review websites.

Discussion

Through linear regression we find that the passenger flows shown by ‘review_count’ are affected by stars, cities, categories, prices and their working hours. And from the estimated coefficients, we find stars and price have obviously positively linear relationship with passengers flows. Surprisely, the working hours in weekends are negatively associated with passenger flows, although the association is not very strong.Thus, the customers are more likely to choose the restaurants with higher stars and resonably higher prices. This may be because they would like to spend more money on better restaurants. And for business owners, we may suggest that they don’t need to pay attention to the opening hours, although there’re some relationships between business hours and passenger flows. In stead, they should improve their customers exprience to receive feedbacks with higher stars.

Usefulness of the results

Words that are frequently mentioned in reviews on Yelp might reflect what people most care when going to a restaurant. This is of great imporance as it helps restaurants to put more efforts on certain aspects and improve their business. Moreover, the words that are frequently mentioned in negative reviews can warn restaurant oweners what they should pay special attention to like the environment and hygiene situation of their restaurant.
The map of the restaurant can help customer to choose the restaurants base on their preferness. They could restrict the category and location of restaurants and try the ones with high ratings and ones they have not tried before. This would be implemented on our website.